1
00:00:05,910 --> 00:00:10,230
Hello everyone, welcome back to the
Heterogeneous Parallel Programming class.

2
00:00:10,230 --> 00:00:15,760
This is lecture 1.2: Introduction to
Heterogeneous Parallel Computing.

3
00:00:15,760 --> 00:00:17,840
The objective of this lecture is for you
to

4
00:00:17,840 --> 00:00:23,600
learn the major differences between
latency devices and throughput devices.

5
00:00:23,600 --> 00:00:29,402
Latency devices means CPU cores today, and
throughput devices means GPU cores today.

6
00:00:29,402 --> 00:00:32,164
To understand how and

7
00:00:32,164 --> 00:00:38,270
why winning applications actually are
increasingly using both types of devices.

8
00:00:40,990 --> 00:00:49,960
This slide shows a typical system on the
chip or SOC for mobile phones today.

9
00:00:49,960 --> 00:00:53,030
In a typical mobile phone, we expect to
see

10
00:00:53,030 --> 00:00:58,000
between two and four CPU cores or latency
cores.

11
00:00:58,000 --> 00:00:59,853
And we also expect to see two to

12
00:00:59,853 --> 00:01:06,230
four throughput oriented cores or GPU
cores where's,

13
00:01:06,230 --> 00:01:11,550
there always, there has been a lot of
hardware intellectual property blocks or

14
00:01:11,550 --> 00:01:18,510
hardware IT blocks that are used for video
decoding sound processing and so on.

15
00:01:18,510 --> 00:01:24,000
And these IP blocks are becoming more and
more programmable as well.

16
00:01:24,000 --> 00:01:30,240
And then we have the digital signal
processing cores cores or DSP cores.

17
00:01:30,240 --> 00:01:31,910
And these cores are increasingly

18
00:01:31,910 --> 00:01:37,880
being used for computing.
And there is a trend

19
00:01:37,880 --> 00:01:43,130
emergent trend for using configurable
logic or configurable course

20
00:01:43,130 --> 00:01:49,400
for for making these SOCs more versatile
in terms of, of computing needs.

21
00:01:49,400 --> 00:01:53,770
And finally we also have a increasing
amount of on-chip memories

22
00:01:53,770 --> 00:01:57,630
that will be used by all these different
types of cores to

23
00:01:57,630 --> 00:02:03,870
minimize their use their assumption of the
DRAM access bandwidth.

24
00:02:03,870 --> 00:02:06,160
On top of all these different cores

25
00:02:06,160 --> 00:02:11,310
and memories there's also a cloud service
path.

26
00:02:11,310 --> 00:02:16,930
Where the applications can choose to use
cloud services rather than a computation

27
00:02:16,930 --> 00:02:22,840
and a local cores if these, if these parts
of the application are too demanding for

28
00:02:22,840 --> 00:02:25,020
the course to handle.

29
00:02:25,020 --> 00:02:30,850
So if we look at this environment the
applications need to deal

30
00:02:30,850 --> 00:02:35,520
with all these different heterogeneous
collection of cores, and that's really

31
00:02:35,520 --> 00:02:40,870
at the heart of what we're teaching in
this class.

32
00:02:40,870 --> 00:02:44,180
So the mobile phone domain is not the

33
00:02:44,180 --> 00:02:47,666
only domain that has been using
heterogeneous parallel computing.

34
00:02:47,666 --> 00:02:52,540
In the supercomputing domain we have very
similar trends where the

35
00:02:52,540 --> 00:02:57,270
top 500 super supercomputers have more

36
00:02:57,270 --> 00:03:02,780
than half of their computing power that
come from GPU cores today.

37
00:03:02,780 --> 00:03:08,208
So we're definitely in a in the time where

38
00:03:08,208 --> 00:03:13,940
heterogeneous parallel computing is
becoming increasingly important.

39
00:03:13,940 --> 00:03:19,420
And the next few slides with essentially
presented fundamental reasons why the

40
00:03:19,420 --> 00:03:27,160
heterogeneous parallel computing is that
is an important trend for applications.

41
00:03:27,160 --> 00:03:31,420
So this slide shows that CPUs and GPUs are
designed very differently.

42
00:03:32,420 --> 00:03:37,370
The CPUs are designed as latency oriented
cores.

43
00:03:37,370 --> 00:03:39,218
And we're showing, I showed

44
00:03:39,218 --> 00:03:41,400
this picture on the left-hand side.

45
00:03:41,400 --> 00:03:44,145
And then we show kind of a simplified
picture

46
00:03:44,145 --> 00:03:48,188
of GPUs, or throughput-oriented cores, on
the right-hand side.

47
00:03:48,188 --> 00:03:50,897
Just a higher-level comparison before
going into

48
00:03:50,897 --> 00:03:53,230
details in the next couple of slides.

49
00:03:53,230 --> 00:04:00,024
I want you to to remember that CPUs tend
to have larger local cache, and GPUs tend

50
00:04:00,024 --> 00:04:04,410
to have smaller cache, or local memory.
And the GPUs

51
00:04:04,410 --> 00:04:07,038
tend to have a larger number of registers,

52
00:04:07,038 --> 00:04:10,480
mostly to support a large number of
threads.

53
00:04:10,480 --> 00:04:13,350
And the CPUs tend to have fewer number

54
00:04:13,350 --> 00:04:17,760
of registers and supports a smaller number
of threads.

55
00:04:17,760 --> 00:04:24,120
And GPUs tend to have larger number of
SIMD execution units, and the CPUs have

56
00:04:24,120 --> 00:04:29,910
fewer of those SIMD units.
The CPUs tend to have sophisticated

57
00:04:29,910 --> 00:04:33,200
control logic and therefore there is a
bigger

58
00:04:33,200 --> 00:04:36,770
a big purple block labelled as control,
whereas

59
00:04:36,770 --> 00:04:42,220
the GPUs tend to have very simple control,
but a large number of threads to manage.

60
00:04:42,220 --> 00:04:46,670
So we have a purple block on the
right-hand side that shows

61
00:04:46,670 --> 00:04:52,650
a large amount of logic to manage schedule
threads and so on.

62
00:04:52,650 --> 00:04:55,010
So let's go a little bit deeper

63
00:04:55,010 --> 00:05:01,580
into the design of each device.
As we mentioned, CPUs are designed

64
00:05:01,580 --> 00:05:06,990
to be to, to designed in a latency
sensitive philosophy.

65
00:05:06,990 --> 00:05:13,334
And so in CPU design we typically will
have three

66
00:05:13,334 --> 00:05:19,500
prominent features.
One is the ALUs or

67
00:05:19,500 --> 00:05:26,240
Arithmetic Logic Units are designed as
powerful logic that would generate

68
00:05:26,240 --> 00:05:31,884
these numerical arithmetic results in very
few clock cycles.

69
00:05:31,884 --> 00:05:38,550
And in typical CPU cores today for a 64
bit double precision

70
00:05:38,550 --> 00:05:45,250
floating point operation the addition and
multiplication operations

71
00:05:45,250 --> 00:05:48,830
on these double precision data values
typically will

72
00:05:48,830 --> 00:05:52,030
take only between one and three clock
cycles.

73
00:05:52,030 --> 00:05:57,646
And these clocks are also running very
very high

74
00:05:57,646 --> 00:06:03,530
frequency and at about 1.523 over 3
gigahertz.

75
00:06:03,530 --> 00:06:10,345
So in terms of the pure time latency of of
arithmetic operation, the CPU's

76
00:06:10,345 --> 00:06:15,250
have extremely short latency producing
these

77
00:06:15,250 --> 00:06:18,400
floating point of the arithmetic values.

78
00:06:18,400 --> 00:06:22,470
The the next big feature, prominent
feature is

79
00:06:22,470 --> 00:06:25,880
that the CPUs tend to have large caches

80
00:06:25,880 --> 00:06:29,010
and these large caches are designed to
convert

81
00:06:29,010 --> 00:06:33,950
long latency memory accesses to short
latency cache accesses.

82
00:06:33,950 --> 00:06:35,460
And that strategy is to

83
00:06:35,460 --> 00:06:40,700
keep as many data elements as possible in
the caches, so that whenever

84
00:06:40,700 --> 00:06:45,810
any of the CPU execution units need to
access data chances are, they will

85
00:06:45,810 --> 00:06:50,870
be able to find the data in the cache from
a previous access.

86
00:06:52,000 --> 00:06:59,160
So the-, therefore the access plan to this
data can be greatly reduced.

87
00:07:00,470 --> 00:07:04,750
The third one is sophisticated control and
the, the

88
00:07:04,750 --> 00:07:10,010
control sup-, logic usually manifests
itself in two forms.

89
00:07:10,010 --> 00:07:15,480
One is branch prediction support for
reduced branch latency.

90
00:07:15,480 --> 00:07:21,310
Branch instructions are generated from
high level language constructs such

91
00:07:21,310 --> 00:07:25,910
as if then else statements such as loops.
And the decision

92
00:07:25,910 --> 00:07:31,140
for picking one of the pass of the if then
else construct or the decision

93
00:07:31,140 --> 00:07:35,840
to determining whether a loop should be
iterated more times, or

94
00:07:35,840 --> 00:07:41,700
exit would be, will be realized with
control instructions.

95
00:07:41,700 --> 00:07:45,345
So because these control instructions need
to make these decisions

96
00:07:45,345 --> 00:07:51,710
uml-, the, in most of the modern
microprocessors, the actual decisions

97
00:07:51,710 --> 00:07:54,392
may take a long time to make.

98
00:07:54,392 --> 00:07:58,040
So having a prediction capability to for
the

99
00:07:58,040 --> 00:08:01,215
hardware to predict which way the branch
instruction will

100
00:08:01,215 --> 00:08:05,380
go, will allow us to immediately fetch and
execute

101
00:08:05,380 --> 00:08:08,880
all the instructions that are in a
particular path.

102
00:08:08,880 --> 00:08:13,890
However we have to allow for the cases
where the prediction may fail.

103
00:08:13,890 --> 00:08:16,860
If the prediction turns out to be
incorrect we need

104
00:08:16,860 --> 00:08:21,560
to keep enough of the processors state
with to be able to

105
00:08:21,560 --> 00:08:26,320
recover from the incorrect prediction and
therefore we need to have all

106
00:08:26,320 --> 00:08:30,160
these resources in the control logic to be
able to support the

107
00:08:30,160 --> 00:08:32,270
branch prediction, making the prediction
as

108
00:08:32,270 --> 00:08:35,420
well as recovering the incorrect
prediction.

109
00:08:35,420 --> 00:08:42,610
We also have a lot of control logic that
is decided for data forwarding and which

110
00:08:42,610 --> 00:08:46,840
takes care of the situation where the
output result

111
00:08:46,840 --> 00:08:51,130
of one instruction is needed by some
subsequent instructions.

112
00:08:51,130 --> 00:08:58,110
And the data forwarding logic that
determines where those instructions

113
00:08:58,110 --> 00:09:02,960
are in the pipeline and route the result
to

114
00:09:02,960 --> 00:09:07,730
those instructions as quickly as possible.
And this involves a large number

115
00:09:07,730 --> 00:09:12,660
of comparison circuitry and also routing
circuitry.

116
00:09:12,660 --> 00:09:18,720
So this can take a lot of chip area as
well as execution power and so on.

117
00:09:18,720 --> 00:09:21,930
So the, the bottom line is, all these

118
00:09:21,930 --> 00:09:26,730
mechanisms are there to reduce the latency
of operations.

119
00:09:26,730 --> 00:09:30,200
The first one reduces the arithmetic
calculation

120
00:09:30,200 --> 00:09:33,150
latency, the second one reduces memory
access

121
00:09:33,150 --> 00:09:38,420
latency and the third one reduces the
branch decision latency as

122
00:09:38,420 --> 00:09:43,930
well as the latency for between the
generation of a data value and

123
00:09:43,930 --> 00:09:49,480
the execution of all the instructions that
require that value as input.

124
00:09:50,910 --> 00:09:54,320
GPUs on the other hand are designed as
throughput oriented

125
00:09:54,320 --> 00:09:59,180
devices and the GPUs tend to have very
small caches.

126
00:09:59,180 --> 00:10:05,260
And these caches are not used to retain
data in the cache for future accesses, but

127
00:10:05,260 --> 00:10:08,070
rather these caches are designed for as
staging

128
00:10:08,070 --> 00:10:12,000
units for m, a large number of threads.

129
00:10:12,000 --> 00:10:18,420
If they're many threads that are
simultaneously ex-, ex-, executing require

130
00:10:18,420 --> 00:10:24,580
the same data, then the cache memory will
consolidate all these requests into one

131
00:10:24,580 --> 00:10:27,920
so that that request can go to the the
DRAM.

132
00:10:27,920 --> 00:10:33,240
And when the data comes back, all done the
cache controller will actually

133
00:10:33,240 --> 00:10:38,300
serve as a forwarding logic to the, to
distribute the data to

134
00:10:38,300 --> 00:10:43,840
all the execution units or threads that
require that data.

135
00:10:43,840 --> 00:10:48,940
So, the latency for going to the DRAM
skill will still be there.

136
00:10:48,940 --> 00:10:53,470
However, the logic allows multiple
accesses to be

137
00:10:53,470 --> 00:10:59,140
consolidated into into a smaller number so
that we can conserve that

138
00:10:59,140 --> 00:11:04,388
traffic go into DRAM.
And the second one is simple control.

139
00:11:04,388 --> 00:11:11,660
And this is shown as the the small yellow
boxes on the left-hand side.

140
00:11:11,660 --> 00:11:14,170
And in GPU's we typically have

141
00:11:14,170 --> 00:11:17,010
no branch prediction and we typically have

142
00:11:17,010 --> 00:11:19,900
no data forwarding or very little data
forwarding.

143
00:11:19,900 --> 00:11:23,890
So the third one is energy efficient ALUs.

144
00:11:23,890 --> 00:11:29,020
Instead of building small number of very
powerful ALUs with

145
00:11:29,020 --> 00:11:34,210
very short latency with GPUs usually come
with a, a large number

146
00:11:34,210 --> 00:11:39,510
of long latency but very power-efficient
low power ALUs.

147
00:11:39,510 --> 00:11:43,740
So these ALUs are usually heavily
pipelined so

148
00:11:43,740 --> 00:11:46,890
that they can accept one operation every
clock

149
00:11:46,890 --> 00:11:48,990
cycle, and then they will take a long

150
00:11:48,990 --> 00:11:51,970
time for each operation to produce their
result.

151
00:11:51,970 --> 00:11:55,230
So, but at the output of the ALU you will

152
00:11:55,230 --> 00:11:59,850
see one result coming out of the pipeline
every clock cycle.

153
00:11:59,850 --> 00:12:04,530
So having so many ALUs and such a

154
00:12:04,530 --> 00:12:10,210
long pipeline for every ALU, we need to
have a large number of threads.

155
00:12:10,210 --> 00:12:19,080
Each one having a a arithmetic operation
that can be executing in those ALUs

156
00:12:19,080 --> 00:12:24,820
in one stage of the ALUs that, so that we
can fully utilize all the hardware.

157
00:12:24,820 --> 00:12:29,700
So that's the reason why these devices are
designed in such a way, that if

158
00:12:29,700 --> 00:12:35,130
we have a massive number of threads, each
one contributing some

159
00:12:35,130 --> 00:12:40,730
you know, operation at the same time, then
we can fully utilize this large num-,

160
00:12:40,730 --> 00:12:46,650
these large number of ALUs to produce
results in a very high throughput.

161
00:12:46,650 --> 00:12:50,610
However, every thread will take a long
time to execute, and every

162
00:12:50,610 --> 00:12:54,960
operation will take a much longer time to
execute than the corresponding,

163
00:12:54,960 --> 00:12:57,000
than their CPU counterparts.

164
00:12:58,110 --> 00:13:02,980
So, now that we understand the design
philosophy

165
00:13:02,980 --> 00:13:05,660
of the CPU cores and GPU cores, then we

166
00:13:05,660 --> 00:13:07,530
should be able to understand that the
winning

167
00:13:07,530 --> 00:13:12,520
applications today need to use both CPUs
and GPUs.

168
00:13:12,520 --> 00:13:16,220
And we will use CPUs to execute sequential
parts of

169
00:13:16,220 --> 00:13:20,350
the application where latency really
matters, because in the sequential

170
00:13:20,350 --> 00:13:23,120
part of the application there are going to
be

171
00:13:23,120 --> 00:13:26,480
very few operation that can be done in
parallel.

172
00:13:26,480 --> 00:13:30,800
And so we just need to be able to execute
these few operations

173
00:13:30,800 --> 00:13:34,920
very, very quickly, so that we can get
through these sections very quickly.

174
00:13:34,920 --> 00:13:40,080
On the other hand, we will use GPUs for
parallel parts, where the throughput wins.

175
00:13:40,080 --> 00:13:42,920
And then in those parallel parts, we'll
have lots

176
00:13:42,920 --> 00:13:45,610
and lots of operations that can be done in
parallel.

177
00:13:45,610 --> 00:13:51,350
So we use those operations to fully
utilize the large number of ALUs in the

178
00:13:51,350 --> 00:13:54,200
GPU so that we can go through, we

179
00:13:54,200 --> 00:13:59,720
can execute through those parallel
sections extremely quickly.

180
00:13:59,720 --> 00:14:03,250
And whenever we use GPUs to execute these
parallel sections

181
00:14:03,250 --> 00:14:09,210
they can easily ten or more times higher
performance than CPUs.

182
00:14:09,210 --> 00:14:11,190
And obviously the CPUs

183
00:14:11,190 --> 00:14:16,340
can easily acheive ten or more times
performance compared

184
00:14:16,340 --> 00:14:20,700
to the GPUs for sequential parts of the
application.

185
00:14:22,000 --> 00:14:29,110
So naturally a large number of application
domains have already

186
00:14:29,110 --> 00:14:35,970
observed this kind of phenomenon and many
of them are using both CPUs and GPUs.

187
00:14:35,970 --> 00:14:37,020
In the

188
00:14:37,020 --> 00:14:41,810
in the field, we typically refer to the
computing

189
00:14:41,810 --> 00:14:45,470
that uses both CPUs and GPUs as GPU
computing.

190
00:14:45,470 --> 00:14:51,558
And so this is definitely a a a, a growing
field.

191
00:14:51,558 --> 00:14:57,348
And in 2010, I edited two volumes of GPU
Computing Gems.

192
00:14:57,348 --> 00:15:02,410
And that included 90 articles on

193
00:15:02,410 --> 00:15:05,880
a successful use of GPU computing in
different

194
00:15:05,880 --> 00:15:09,640
application fields, as as well as some of
the

195
00:15:09,640 --> 00:15:13,610
tools and environments that help
application developers to be

196
00:15:13,610 --> 00:15:18,850
able to use heterogeneous computing
successfully in their applications.

197
00:15:18,850 --> 00:15:23,420
So the the domains that have successfully
used

198
00:15:23,420 --> 00:15:27,450
heterogeneous parallel computing is
definitely growing and these are

199
00:15:27,450 --> 00:15:32,550
the fields that were covered by the GPU
computing gems volumes.

200
00:15:32,550 --> 00:15:36,870
that, these fields include financial
analysis, scientific simulation,

201
00:15:36,870 --> 00:15:42,000
engineering simulation, data intensive
analysis, medical imaging, digital

202
00:15:42,000 --> 00:15:47,240
audio processing, digital video
processing, computer vision, biomedical

203
00:15:47,240 --> 00:15:53,290
informatics, electronic design automation
statistical modeling, numerical methods

204
00:15:53,290 --> 00:15:56,530
ray tracing rendering and interactive
physics.

205
00:15:56,530 --> 00:16:03,260
As you can see many of these applications
are heavily used in our daily lives.

206
00:16:05,110 --> 00:16:08,950
So this concludes, our lecture

207
00:16:08,950 --> 00:16:12,860
on Introduction to Heterogeneous Parallel
Computing.

208
00:16:12,860 --> 00:16:18,710
At this point you should be able to
understand the very different nature

209
00:16:18,710 --> 00:16:25,730
of CPUs and GPUs and this will be a
important foundation for us to to,

210
00:16:25,730 --> 00:16:30,800
to talk about programming parallel
programming of the

211
00:16:30,800 --> 00:16:34,000
throughput oriented devices in the next
few weeks.

212
00:16:34,000 --> 00:16:38,850
And if you would like to learn more about
the

213
00:16:38,850 --> 00:16:43,710
nature of CPUs versus GPUs and some of the
more

214
00:16:43,710 --> 00:16:48,680
subtle you know, insight about
Heterogeneous Parallel Computing, I'd like

215
00:16:48,680 --> 00:16:54,830
you to encourage to, you to read chapter
one of the textbook by David Kirkeneim.

216
00:16:54,830 --> 00:16:55,200
Thank you.